{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Wp6u2WFB6sD3"
   },
   "source": [
    "# Python Libraries\n",
    "\n",
    "For this tutorial, we are going to explore the python libraries that include functionality that corresponds with the material discussed in the course.\n",
    "\n",
    "The primary package we will be using is:\n",
    "\n",
    "* **Statsmodels:** a library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, exploring data, and constructing models.  \n",
    "\n",
    "*__ATTN__: If you are not familiar with the following packages:*  \n",
    "\n",
    "* **Numpy** is a library for working with arrays of data.  \n",
    "\n",
    "* **Pandas** is a library for data management, manipulation, and analysis.  \n",
    "\n",
    "* **Matplotlib** is a library for making visualizations.  \n",
    "\n",
    "* **Seaborn** is a higher-level interface to Matplotlib that can be used to simplify many visualization tasks.  \n",
    "\n",
    "We recommend you check out the first and second courses of the Statistics with Python specialization, **Understanding and Visualizing Data** and **Inferential Statistical Analysis with Python**.\n",
    "\n",
    "*__Important__: While this notebooks provides insight into the basics of these libraries,  it is recommended that you dig into the documentation available online.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "bS09niUH6sD4"
   },
   "source": [
    "## StatsModels\n",
    "\n",
    "The StatsModels library is extremely extensive and includes functionality ranging from statistical methods to advanced topics such as regression, time-series analysis, and multivariate statistics.\n",
    "\n",
    "We will mainly be looking at the stats, OLS, and GLM sub-libraries.  However, we will begin by reviewing some functionality that has been referenced in earlier course of the Statistics with Python specialization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zAmQwUrM6sD4"
   },
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yHW550yS6sD9"
   },
   "source": [
    "### Stats\n",
    "\n",
    "#### Descriptive Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "T0Diw1p66sD-",
    "outputId": "38a8e936-209c-49de-d828-7dae305acaf8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<statsmodels.stats.weightstats.DescrStatsW object at 0x7f4d6ed5ff28>\n"
     ]
    }
   ],
   "source": [
    "# Draw random variables from a normal distribution with numpy\n",
    "normalRandomVariables = np.random.normal(0,1, 1000)\n",
    "\n",
    "# Create object that has descriptive statistics as variables\n",
    "x = sm.stats.DescrStatsW(normalRandomVariables)\n",
    "\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "O3VPrgBO6sEC"
   },
   "source": [
    "As you can see from the above output, we have created an object with type: \"statsmodels.stats.weightstats.DescrStatsW\".  \n",
    "\n",
    "This object stores various descriptive statistics such as mean, standard deviation, variance, ect. that we can access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "t080a6ZB6sEE",
    "outputId": "b275eed9-d036-4212-bfcc-0c8a3c72f89e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.008118573821852588\n",
      "1.0205630300510322\n",
      "1.0415488983069439\n"
     ]
    }
   ],
   "source": [
    "# Mean\n",
    "print(x.mean)\n",
    "\n",
    "# Standard deviation\n",
    "print(x.std)\n",
    "\n",
    "# Variance\n",
    "print(x.var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DvveuP5x6sEH"
   },
   "source": [
    "The output above shows the mean, standard deviation, and variance of the 1000 random variables we drew from the distribution we generated above.\n",
    "\n",
    "There are other interesting things you can do with this object, such as generating confidence intervals and hypothesis testing.\n",
    "\n",
    "#### Confidence Intervals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "B0pWYUHz6sEJ",
    "outputId": "eda277fe-cc28-44ba-c4d2-f667dd9846dd"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.8227378265796143, 0.8772621734203857)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate confidence interval for a population proportion\n",
    "\n",
    "tstar = 1.96\n",
    "\n",
    "# Observer population proportion\n",
    "p = .85\n",
    "\n",
    "# Size of population\n",
    "n = 659\n",
    "\n",
    "# Construct confidence interval\n",
    "sm.stats.proportion_confint(n * p, n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AuZGEYlG6sEM"
   },
   "source": [
    "The above output includes the lower and upper bounds of a 95% confidence interval of population proportion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "kwBfjY_y6sEN",
    "outputId": "475a8442-9cc0-4841-a8f3-0140583a8f6b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(76.57715593233024, 88.38284406766977)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Import data that will be used to construct confidence interval of population mean\n",
    "df = pd.read_csv(\"https://raw.githubusercontent.com/UMstatspy/UMStatsPy/master/Course_1/Cartwheeldata.csv\")\n",
    "\n",
    "# Generate confidence interval for a population mean\n",
    "sm.stats.DescrStatsW(df[\"CWDistance\"]).zconfint_mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pk-nzwLg6sEP"
   },
   "source": [
    "The output above shows the lower and upper bounds of a 95% confidence interval of population mean.\n",
    "\n",
    "These functions should be familiar, if not, we recommend you take course 2 of our specialization.\n",
    "#### Hypothesis Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "aC6ZmdvT6sEQ",
    "outputId": "5f1243c9-34e5-4bda-9bda-1dfc8bc052b9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2.571067795759113, 0.010138547731721065)"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# One population proportion hypothesis testing\n",
    "\n",
    "# Population size\n",
    "n = 1018\n",
    "\n",
    "# Null hypothesis population proportion\n",
    "pnull = .52\n",
    "\n",
    "# Observe population proportion\n",
    "phat = .56\n",
    "\n",
    "# Calculate test statistic and p-value\n",
    "sm.stats.proportions_ztest(phat * n, n, pnull)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_tdgO3pL6sEU",
    "outputId": "37a8fa49-ba7e-4996-e1e6-85b80b22383b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.8234523266982029, 0.20512540845395266)"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using the dataframe imported above, perform a hypothesis test for population mean\n",
    "sm.stats.ztest(df[\"CWDistance\"], value = 80, alternative = \"larger\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Ko78IcAX6sEX"
   },
   "source": [
    "The outputs above are the test statistics and p-values from the respective hypothesis tests.\n",
    "\n",
    "If you'd like to review these functions on your own, the stats sub-library documentation can be found at the following url: https://www.statsmodels.org/stable/stats.html\n",
    "\n",
    "This concludes the review portion of this notebook,  now we are going to introduce the OLS and GLM sub-libraries and the functions you will be seeing throughout this course.\n",
    "\n",
    "# OLS (Ordinary Least Squares), GLM (Generalized Linear Models), GEE (Generalize Estimated Equations), MIXEDLM (Multilevel Models)\n",
    "\n",
    "The OLS, GLM, GEE, and MIXEDLM sub-libraries are the primary libraries in statsmodels that we will be utilizing in this course to create various models.\n",
    "\n",
    "Below, we will give a brief description of each model and a skeleton of the functions you will see going forward in the course.  This is simply for you to get familiar with these concepts and to prepare you for the coming weeks.  If their application at this time seems a bit ambigious have no fear as they will be discussed in detail throughout this course!\n",
    "\n",
    "For each of the following models, we follow our similar structure which means we will be following our structure of Dependent and Independent Variables, with a few caveats that will be expressed below.\n",
    "\n",
    "#### Ordinary Least Squares\n",
    "\n",
    "Ordinary Least Squares is a method for estimating the unknown parameters in a linear regression model.  This is the function we will use when our target variable is continuous."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "MhalCiu-6sEX"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                 BPXSY1   R-squared:                       0.215\n",
      "Model:                            OLS   Adj. R-squared:                  0.214\n",
      "Method:                 Least Squares   F-statistic:                     697.4\n",
      "Date:                Sun, 07 Jul 2019   Prob (F-statistic):          1.87e-268\n",
      "Time:                        20:01:24   Log-Likelihood:                -21505.\n",
      "No. Observations:                5102   AIC:                         4.302e+04\n",
      "Df Residuals:                    5099   BIC:                         4.304e+04\n",
      "Df Model:                           2                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "=====================================================================================\n",
      "                        coef    std err          t      P>|t|      [0.025      0.975]\n",
      "-------------------------------------------------------------------------------------\n",
      "Intercept           100.6305      0.712    141.257      0.000      99.234     102.027\n",
      "RIAGENDRx[T.Male]     3.2322      0.459      7.040      0.000       2.332       4.132\n",
      "RIDAGEYR              0.4739      0.013     36.518      0.000       0.448       0.499\n",
      "==============================================================================\n",
      "Omnibus:                      706.732   Durbin-Watson:                   2.036\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1582.730\n",
      "Skew:                           0.818   Prob(JB):                         0.00\n",
      "Kurtosis:                       5.184   Cond. No.                         168.\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "da = pd.read_csv(\"nhanes_2015_2016.csv\")\n",
    "\n",
    "# Drop unused columns, drop rows with any missing values.\n",
    "vars = [\"BPXSY1\", \"RIDAGEYR\", \"RIAGENDR\", \"RIDRETH1\", \"DMDEDUC2\", \"BMXBMI\",\n",
    "        \"SMQ020\", \"SDMVSTRA\", \"SDMVPSU\"]\n",
    "da = da[vars].dropna()\n",
    "\n",
    "da[\"RIAGENDRx\"] = da.RIAGENDR.replace({1: \"Male\", 2: \"Female\"})\n",
    "\n",
    "model = sm.OLS.from_formula(\"BPXSY1 ~ RIDAGEYR + RIAGENDRx\", data=da)\n",
    "res = model.fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ujGjtQwR6sEa"
   },
   "source": [
    "The above code is creating a multiple linear regression where the target variable is BPXSY1 and the two predictor variables are RIDAGEYR and RIAGENDRx.\n",
    "\n",
    "Note that the target variable, BPXSY1, is a continous variable that represents blood pressure."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "t7mrqEjx6sEb"
   },
   "source": [
    "#### Generalized Linear Models\n",
    "\n",
    "While generalized linear models are a broad topic, in this course we will be using this suite of functions to carry out logistic regression.  Logistic regression is used when our target variable is a binary outcome, or a classification of two groups, which can be denoted as group 0 and group 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "66gX6Ivv6sEb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                 Generalized Linear Model Regression Results                  \n",
      "==============================================================================\n",
      "Dep. Variable:                    smq   No. Observations:                 5094\n",
      "Model:                            GLM   Df Residuals:                     5092\n",
      "Model Family:                Binomial   Df Model:                            1\n",
      "Link Function:                  logit   Scale:                          1.0000\n",
      "Method:                          IRLS   Log-Likelihood:                -3350.6\n",
      "Date:                Sun, 07 Jul 2019   Deviance:                       6701.2\n",
      "Time:                        20:03:20   Pearson chi2:                 5.09e+03\n",
      "No. Iterations:                     4   Covariance Type:             nonrobust\n",
      "=====================================================================================\n",
      "                        coef    std err          z      P>|z|      [0.025      0.975]\n",
      "-------------------------------------------------------------------------------------\n",
      "Intercept            -0.7547      0.042    -18.071      0.000      -0.837      -0.673\n",
      "RIAGENDRx[T.Male]     0.8851      0.058     15.227      0.000       0.771       0.999\n",
      "=====================================================================================\n"
     ]
    }
   ],
   "source": [
    "da[\"smq\"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})\n",
    "model = sm.GLM.from_formula(\"smq ~ RIAGENDRx\", family=sm.families.Binomial(), data=da)\n",
    "res = model.fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HAGWs0sA6sEg"
   },
   "source": [
    "Above is a example of creating a logistic model where the target value is SMQ020x, which in this case is whether or not this person is a smoker or not.  The predictor is RIAGENDRx, which is gender."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HPr2J3Xu6sEh"
   },
   "source": [
    "#### Generalized Estimated Equations\n",
    "\n",
    "Generalized Estimating Equations estimate generalized linear models for panel, cluster or repeated measures data when the observations are possibly correlated within a cluster but uncorrelated across clusters.  These are used primarily when there is uncertainty regarding correlation between outcomes. \"Generalized Estimating Equations\" (GEE) fit marginal linear models, and estimate intraclass correlation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SDMVSTRA</th>\n",
       "      <th>SDMVPSU</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>125</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>125</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>131</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>131</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>126</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   SDMVSTRA  SDMVPSU\n",
       "0       125        1\n",
       "1       125        1\n",
       "2       131        1\n",
       "3       131        1\n",
       "4       126        2"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "da[['SDMVSTRA', 'SDMVPSU']].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "N2UPibR56sEi"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The correlation between two observations in the same cluster is 0.030\n"
     ]
    }
   ],
   "source": [
    "da[\"group\"] = 10*da.SDMVSTRA + da.SDMVPSU\n",
    "model = sm.GEE.from_formula(\"BPXSY1 ~ 1\", groups=\"group\", cov_struct=sm.cov_struct.Exchangeable(), data=da)\n",
    "res = model.fit()\n",
    "print(res.cov_struct.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "GI-wvJZz6sEm"
   },
   "source": [
    "Here we are creating a marginal linear model of BPXSY1 to determine the estimated ICC value, which would indicate whether or not there are correlated clusters of BPXSY1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "qTDJVIpQ6sEn"
   },
   "source": [
    "#### Multilevel Models\n",
    "\n",
    "Similarly to GEEs, we use multilevel models when there is potential for outcomes to be grouped together which is not uncommon when using various sampling methods to collect data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "kx8UyI4b6sEo"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BPXSY1 The correlation between two observations in the same cluster is 0.030\n",
      "RIDAGEYR The correlation between two observations in the same cluster is 0.035\n",
      "BMXBMI The correlation between two observations in the same cluster is 0.039\n",
      "smq The correlation between two observations in the same cluster is 0.026\n",
      "SDMVSTRA The correlation between two observations in the same cluster is 0.959\n"
     ]
    }
   ],
   "source": [
    "for v in [\"BPXSY1\", \"RIDAGEYR\", \"BMXBMI\", \"smq\", \"SDMVSTRA\"]:\n",
    "    model = sm.GEE.from_formula(v + \" ~ 1\", groups=\"group\",\n",
    "           cov_struct=sm.cov_struct.Exchangeable(), data=da)\n",
    "    result = model.fit()\n",
    "    print(v, result.cov_struct.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oCAM_2cf6sEt"
   },
   "source": [
    "What;s nice about the statsmodels library is that all the models follow the similar structure and syntax.  \n",
    "\n",
    "\n",
    "Documentation and examples of these models can be found at the following links:\n",
    "\n",
    "* OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html\n",
    "\n",
    "* GLM: https://www.statsmodels.org/stable/glm.html\n",
    "\n",
    "* GEE: https://www.statsmodels.org/stable/gee.html\n",
    "\n",
    "* MIXEDLM: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html\n",
    "\n",
    "Feel free to read up on these sub-libraries and their use cases.  In week 2 you will see examples of OLS and GLM, where in week 3, we will be implementing GEE and MIXEDLM."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Important Python Libraries - Fitting Statistical Models to Data with Python.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
